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Abstract 

Bilingual  term  lists  are  extensively  used  as  a  resource  for 
dictionary-based  Cross-Language  Information  Retrieval 
(CLIR),  in  which  the  goal  is  to  find  documents  written  in 
one  natural  language  based  on  queries  that  are  expressed 
in  another.  This  paper  identifies  eight  types  of  terms  that 
affect  retrieval  effectiveness  in  CLIR  applications  through 
their  coverage  by  general-purpose  bilingual  term  lists,  and 
reports  results  from  an  experimental  evaluation  of  the  cov¬ 
erage  of  35  bilingual  term  lists  in  news  retrieval  applica¬ 
tion.  Retrieval  effectiveness  was  found  to  be  strongly  influ¬ 
enced  by  term  list  size  for  lists  that  contain  between  3,000 
and  30,000  unique  terms  per  language.  Supplemental  tech¬ 
niques  for  named  entity  translation  were  found  to  be  useful 
with  even  the  largest  lexicons.  The  contribution  of  named 
entity  translation  was  evaluated  in  a  cross-language  ex¬ 
periment  involving  English  and  Chinese.  Smaller  effects 
were  obsen’edfrom  deficiencies  in  the  coverage  of  domain- 
specific  terminology  when  searching  news  stories. 


1  Introduction 

The  goal  of  Cross-Language  Information  Retrieval 
(CLIR)  is  to  support  the  task  of  searching  multilingual  col¬ 
lections  by  allowing  users  to  enter  queries  in  a  language  that 
might  be  different  from  that  in  which  the  documents  are 
written.  In  dictionary-based  CLIR  techniques,  the  princi¬ 
pal  source  of  translation  knowledge  is  a  translation  lexicon 
(which  often  contains  information  extracted  from  machine- 
readable  dictionaries).  Very  simple  translation  lexicons, 
bilingual  term  lists  with  no  information  about  selectional 
preference,  are  widely  used  for  this  purpose.  Bilingual  term 
lists  are  widely  available,  but  our  experience  suggests  that 
retrieval  effectiveness  can  vary  substantially  from  one  lan¬ 
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guage  pair  to  another  and  even  within  a  language  pair,  de¬ 
pending  on  the  size  and  quality  of  the  bilingual  term  list. 
The  causes  of  this  variation  are  not  yet  well  understood,  and 
our  goal  in  this  paper  is  to  explore  one  possible  cause — term 
list  coverage — using  real  bilingual  term  lists. 

The  translation  component  of  dictionary-based  CLIR 
techniques  depend  on  a  successful  cascade  of  three  pro¬ 
cesses:  (1)  selection  of  the  terms  to  be  translated,  (2)  gen¬ 
eration  of  a  set  of  candidate  translations,  and  (3)  use  of  that 
set  of  candidate  translations  in  the  retrieval  process.  For  the 
first  stage,  the  best  results  are  typically  obtained  by  trans¬ 
lating  multiword  expressions  when  possible,  backing  off 
to  individual  words  when  necessary,  and  further  backing 
off  to  morphological  roots  when  the  surface  form  cannot 
be  found  [17].  In  the  second  stage,  algorithms  for  choos¬ 
ing  among  alternative  translations  have  been  extensively 
studied,  and  older  techniques  based  on  averaging  weights 
computed  for  each  translation  can  benefit  significantly  from 
translation  selection  based  on  term  co-occurrence  within  the 
target  corpora  [12].  The  focus  on  the  third  stage  has  been 
somewhat  more  recent,  with  the  best  presently  known  tech¬ 
nique  based  on  accumulating  term  frequency  and  document 
frequency  evidence  separately  in  the  document  language, 
then  combining  that  evidence  to  create  query-language  term 
weights  [15,  13].  What  is  less  well  understood,  however,  is 
the  effect  of  term  list  coverage.  If  a  possible  translation  is 
not  known,  it  cannot  be  selected,  and  therefore  cannot  be 
used.  What  effect  will  this  have  on  retrieval?  That  question 
is  the  focus  of  this  paper. 

We  begin  with  a  review  of  prior  controlled  studies  on 
this  topic  in  which  the  effect  of  systematically  altering  term 
list  coverage  on  retrieval  effectiveness  has  been  character¬ 
ized.  Each  of  these  studies  depended  on  artificially  ablating 
the  coverage  in  some  manner,  and  to  the  best  of  our  knowl¬ 
edge  none  of  these  approaches  have  been  validated  through 
comparison  with  naturally  occurring  bilingual  term  lists  of 
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various  sizes.  In  Section  3  we  address  this  concern  by  as¬ 
sessing  the  effect  of  coverage  deficiencies  in  actual  bilin¬ 
gual  term  lists  obtained  from  several  sources.  Our  results 
indicate  that  retrieval  effectiveness  is  indeed  positively  cor¬ 
related  with  term  list  size,  but  that  term  lists  with  similar 
sizes  sometimes  yield  substantially  different  retrieval  effec¬ 
tiveness.  We  therefore  examine  one  term  list  in  greater  de¬ 
tail  in  Section  4,  identifying  eight  types  of  missing  terms. 
Named  entities  are  found  to  be  by  far  the  most  common 
type  of  missing  term  in  news  stories,  so  we  examine  the  ef¬ 
fect  of  named  entity  translation  in  detail  in  Section  5.  With 
this  as  background,  we  then  describe  the  known  techniques 
for  accommodating  coverage  deficiencies  when  building 
CLIR  systems  and  recommend  topics  for  further  research 
that  seem  particularly  promising  in  light  of  the  insights  that 
we  have  obtained  through  the  experiments  reported  in  this 
paper. 

2  Background 

We  began  our  exploration  of  this  question  several  years 
ago  with  a  simple  experiment  in  which  we  used  two  bilin¬ 
gual  term  lists  from  different  sources  to  measure  the  ef¬ 
fect  of  the  linguistic  resource  on  the  effectiveness  of  cross¬ 
language  information  retrieval  [8].  We  obtained  what 
seemed  like  a  counterintuitive  result:  the  smaller  term  list 
(with  30,322  unique  English  terms)  actually  did  some¬ 
what  better  than  the  larger  list  (with  89,003  unique  English 
terms),  although  the  difference  was  neither  large  nor  statisti¬ 
cally  significant.  We  tried  combining  the  two  lists  (resulting 
in  97,603  unique  English  terms),  obtaining  an  effectiveness 
measure  between  the  two  that  of  the  small  and  the  large  list. 
Clearly,  the  size  of  the  list  did  not  tell  the  whole  story  and 
the  key  question  was  not  whether  you  know  a  lot  of  transla¬ 
tions,  but  whether  you  know  the  right  ones! 

This  first  study  of  ours  had  been  inspired  in  part  by  a 
paper  in  which  Grefenstette  had  suggested  several  alterna¬ 
tive  measures  for  the  coverage  measures  for  assessing  the 
utility  of  a  translation  lexicon  in  CLIR  applications.  For 
example,  he  had  calculated  that  the  English  portion  of  the 
relatively  large  (37,600  entry)  ELRA  Basic  Multilingual 
Lexicon  covered  common  terms  quite  well,  with  97%  of 
the  1,000  most  common  English  words  being  found  (af¬ 
ter  splitting  multiword  expressions,  conflating  inflectional 
variants,  and  excluding  proper  names)  [6].  Less  common 
words  seemed  more  problematic,  however,  with  only  51% 
of  the  most  common  50,000  English  words  found  in  the 
lexicon.  We  therefore  developed  some  alternative  coverage 
measures  based  on  different  ways  of  looking  at  term  impor¬ 
tance.  We  were,  however,  unable  to  find  a  measure  that  was 
positively  correlated  with  retrieval  effectiveness. 

A  subsequent  ablation  study  by  Xu  and  Weischedel  be¬ 
gan  to  shed  some  light  on  this  counterintuitive  result  [20]. 


Starting  with  an  80,000-word  English-Chinese  bilingual 
term  list,  they  simulated  the  effect  of  smaller  term  lists  by 
progressively  removing  terms  from  the  list  in  order  of  in¬ 
creasing  frequency  of  occurrence  in  a  large  collection  of 
English  documents.  They  found  that  mean  average  preci¬ 
sion  increased  with  the  size  of  the  lexicon,  but  reached  a 
plateau  once  translations  for  the  most  common  20,000  En¬ 
glish  words  were  included.  If  this  ablation  process  accu¬ 
rately  models  the  way  in  which  real  bilingual  term  lists  are 
formed,  then  a  plausible  explanation  for  the  results  we  ob¬ 
served  in  our  first  study  would  be  that  the  added  terms  in 
the  larger  bilingual  term  list  were  rare,  and  thus  not  likely 
to  be  observed  in  any  particular  set  of  queries. 

That  was  the  point  of  departure  for  the  work  reported 
in  this  paper — Xu  and  Weischadel’s  approach  seemed  to 
offer  a  useful  insight  into  coverage  effects,  but  before  we 
could  generalize  from  that  work  we  needed  some  insight 
into  whether  their  ablation  model  captured  what  really  hap¬ 
pened  when  people  built  bilingual  term  lists.  After  com¬ 
pleting  our  work,  we  learned  of  a  concurrent  study  by  Mc- 
Namee  and  Mayfield  that  shed  additional  light  on  this  ques¬ 
tion  [10].  They  tried  an  approach  similar  to  that  of  Xu 
and  Weischadel,  but  with  two  key  differences:  (1)  the  se¬ 
lection  of  terms  for  which  translation  was  suppressed  was 
made  randomly  with  a  uniform  distribution,  and  (2)  corpus- 
based  pre-translation  expansion  was  used  to  enrich  the  set 
of  terms  to  be  translated.  With  no  pre-translation  expan¬ 
sions,  they  observed  fairly  consistent  declines  in  retrieval 
effectiveness  with  coverage  ablation,  even  with  relatively 
large  conditions.  This  tends  to  confirm  our  intuition  that  the 
way  in  which  coverage  ablation  is  modeled  is  consequen¬ 
tial.  McNamee  and  Mayfield’s  most  important  result,  how¬ 
ever,  is  that  much  of  the  lost  effectiveness  can  be  regained 
using  corpus-based  pre-translation  expansion,  because  the 
effect  of  expansion  increases  markedly  at  higher  ablation 
levels.  We  did  not  use  pre-translation  expansion  for  the  ex¬ 
periments  reported  in  this  paper,  and  it  is  now  quite  clear 
that  this  will  be  an  important  area  for  further  work. 

3  Characterizing  Term  List  Coverage 

We  obtained  from  the  Internet  34  freely  distributed  bilin¬ 
gual  term  lists  that  each  pair  English  with  one  of  24  other 
languages,  and  we  extracted  a  35th  bilingual  term  list 
from  a  large  machine-readable  bilingual  dictionary  (see  Ap¬ 
pendix  A  for  a  listing).  The  smallest  of  these  term  lists 
(English-Eskimo)  contains  700  unique  English  terms — the 
largest  (an  English-Chinese  term  list  extracted  from  the 
machine -readable  dictionary)  contains  193,297  unique  En¬ 
glish  terms.  Although  that  set  contains  a  good  spread  of 
sizes  (measured  as  the  number  of  unique  English  terms),  no 
single  language  pair  is  represented  by  more  than  four  term 
lists.  Gaining  the  sort  of  insight  that  we  sought  using  multi- 


pie  language  pairs  would  be  difficult,  however,  because  cov¬ 
erage  effects  might  well  be  masked  by  other  factors  (e.g., 
differential  importance  and/or  effectiveness  of  compound 
splitting  and  morphological  normalization).  We  therefore 
chose  to  focus  only  on  the  English  side  of  each  term  list. 

There  are  three  basic  approaches  to  CLIR:  translate  the 
query  into  the  language  of  the  document  collection  [7]; 
translate  the  documents  into  the  language  of  the  queries  [9]; 
or  create  a  language-neutral  representation  of  both  the 
queries  and  the  documents  (c.f.  [4]).  We  chose  to  model  a 
query  translation  process,  and  to  suppress  effects  other  than 
coverage  by  simulating  translation  from  English  to  English 
in  a  way  that  was  sensitive  to  the  English-language  coverage 
of  each  term  list. 

For  our  experiments,  we  used  an  information  retrieval 
test  collection  from  the  Cross-Language  Evaluation  Forum 
(CLEF  2000).  The  collection  contains  113,000  English 
news  stories  from  the  Los  Angeles  Times  (about  435  MB 
of  text),  33  English  topic  descriptions,1  and  binary  (yes-no) 
relevance  judgments  for  topic -document  pairs. 

We  used  this  monolingual  test  collection  with  each  spe¬ 
cific  bilingual  term  list  to  simulate  CLIR  in  the  following 
manner: 

•  English  queries  are  formed  using  every  word  in  the  ti¬ 
tle  and  description  fields  of  the  topic  description.  This 
is  typically  a  sentence  or  two,  representative  of  how 
an  information  need  might  initially  be  expressed  to  a 
human  intermediary  that  is  helping  with  the  search. 
We  repeated  our  experiments  with  shorter  queries  built 
from  the  title  field  alone  (which  are  designed  to  be 
representative  of  what  a  searcher  might  type  into  a 
Web  search  engine),  obtaining  similar  results.  Because 
the  observed  coverage  effects  were  very  similar  for 
both  sets  of  queries,  we  present  results  only  for  the 
title+description  queries  in  this  paper. 

•  Any  English  word  that  does  not  appear  on  the  En¬ 
glish  side  of  the  bilingual  term  list  was  removed  from 
the  query.  We  refer  to  this  process  as  “filtering”  the 
query  using  the  term  list.  Resnik  et  al.  observed  that 
bilingual  term  lists  found  on  the  Internet  often  con¬ 
tain  an  eclectic  mix  of  root  forms  and  morphological 
variants,  and  proposed  a  backoff  translation  strategy  in 
which  English  words  with  the  same  stems  were  con¬ 
flated  prior  to  translation  [17].  This  achieves  an  effect 
similar  to  McNamee  and  Mayfield’s  pre-translation  ex¬ 
pansion,  but  the  key  difference  is  that  the  conflation 
is  performed  only  if  the  surface  form  of  the  word  to 
be  translated  is  not  found — this  limits  the  introduction 

1  The  CLEF  2000  collection  contains  40  topics,  but  no  relevant  English 
documents  are  known  for  topics  2,  6,  8,  23,  25,  27,  and  35,  so  they  were 
excluded  from  our  experiments  because  they  cannot  distinguish  between 
the  conditions  that  we  wished  to  explore. 


of  spurious  translations  when  good  ones  are  already 
known.  We  modeled  backoff  translation  by  including 
an  alternate  condition  in  which  English  terms  in  the 
bilingual  term  list  were  reduced  to  their  stems  using 
the  Porter  stemmer  and  added  to  the  original  list  be¬ 
fore  matching.  Because  our  English-English  coverage 
assessment  includes  no  actual  translation,  we  modeled 
this  as  a  single  step  with  no  actual  backoff,  but  we 
nonetheless  refer  to  it  as  the  “backoff"  condition. 

•  We  use  English  as  a  surrogate  for  the  second  language, 
so  we  do  not  actually  translate  the  filtered  query  in 
this  experiment.  This  can  be  thought  of  as  modeling 
a  case  in  which  translation  of  known  terms  is  perfect 
and  in  which  the  handling  of  target-language  terms  is 
consistent.  Since  this  will  obviously  sometimes  not  be 
the  case  in  real  applications,  our  method  clearly  com¬ 
putes  only  upper  bounds  on  retrieval  effectiveness — 
we  would  expect  the  results  in  actual  analogous  end- 
to-end  CLIR  applications  to  be  lower.  But  by  holding 
these  other  factors  constant,  we  are  able  to  focus  more 
sharply  on  coverage  effects. 

•  We  search  the  English  document  collection  using  the 
InQuery  text  retrieval  system  and  the  filtered  query.  In- 
Query  is  a  state-of-the-art  system  that  ranks  documents 
in  decreasing  order  of  likely  relevance.  We  then  com¬ 
pute  mean  average  precision,  a  commonly  used  mea¬ 
sure  of  comparative  retrieval  effectiveness  that  reflects 
the  expected  density  of  relevant  documents  near  the 
top  of  the  ranked  list  [19].  Mean  average  precision  as¬ 
sumes  values  between  zero  and  one  (with  higher  values 
preferred).  We  performed  a  second  set  of  experiments 
using  a  vector-space  text  retrieval  system  (MG)  with 
similar  results,  so  we  are  reasonably  confident  that  the 
results  reported  in  this  paper  are  not  overly  dependent 
on  the  details  of  the  design  of  the  retrieval  system. 

In  our  experiments,  the  principal  independent  variable  is 
the  number  of  unique  English  terms  in  the  bilingual  term 
list  (which  we  plot  on  the  X  axis),  and  the  principal  depen¬ 
dent  variable  is  mean  average  precision  (on  the  Y  axis).  We 
used  a  two-tailed  t- test,  pairing  observations  of  mean  aver¬ 
age  precision  values  by  bilingual  term  list  size,  to  test  the 
statistical  significance  of  observed  differences. 

As  Figure  1  shows,  the  mean  average  precision  for  bilin¬ 
gual  term  lists  of  similar  size  exhibits  quite  a  lot  of  vari¬ 
ation.  Simulating  backoff  translation  typically  results  in 
greater  retrieval  effectiveness  for  any  size  term  list  (statis¬ 
tically  significant  at  p  <  0.001),  and  in  less  variation  in 
retrieval  effectiveness  across  the  set  of  relatively  large  term 
lists.  This  clearly  indicates  that  proper  handling  of  morpho¬ 
logical  variants  is  an  important  issue  for  dictionary-based 
CLIR.  Most  of  the  smallest  bilingual  term  lists — up  to  about 


3,000  unique  English  terms — are  of  little  use  for  CLIR.  Ap¬ 
proximately  linear  growth  in  mean  average  precision  is  evi¬ 
dent  between  about  3,000  and  20,000  terms,  with  little  fur¬ 
ther  improvement  observed  beyond  that  range.  These  obser¬ 
vations  are  consistent  with  Xu  and  Weischadel’s  assumption 
that  small  term  lists  predominantly  contain  very  common 
terms  (which  InQuery  gives  little  weight  to)  and  that  the  ad¬ 
ditional  terms  present  in  the  largest  term  lists  are  so  rarely 
used  that  they  are  very  unlikely  to  be  present  in  a  query,  and 
therefore  tend  to  support  their  model. 

<ut 


c  -  bilingual  term  list  size 
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Figure  1.  Effects  of  lexicon  size.  Upper  area 
(triangles)  with  backoff  translation,  lower  area 
(circles)  without. 

In  this  case,  the  original  queries  with  no  terms  removed 
achieve  a  mean  average  precision  of  about  0.4.  Interest¬ 
ingly,  even  when  backoff  translation  is  simulated,  many  of 
the  largest  bilingual  term  lists  yield  a  mean  average  preci¬ 
sion  value  that  is  15-20%  below  that.  Moreover,  mean  aver¬ 
age  precision  aggregates  the  effect  over  many  queries — the 
effect  on  average  precision  for  individual  queries  is  even 
more  dramatic.  For  example,  topic  17  (Bush  fire  near  Syd¬ 
ney )  achieves  an  average  precision  of  0.67  with  one  term 
list,  but  0.32  with  another  of  similar  size.  In  the  next  sec¬ 
tion,  we  explore  the  question  of  what  is  missing  from  bilin¬ 
gual  term  lists  found  on  the  Internet  that  might  be  important 
in  CLIR  applications. 

4  The  Missing  Terms 

Our  goal  in  this  section  is  to  examine  a  representative 
set  of  untranslatable  terms  that  might  appear  in  queries  in 
order  to  focus  our  future  efforts  to  improve  dictionary  cov¬ 
erage.  When  assessing  coverage  effects  using  a  test  col¬ 
lection,  we  can  only  see  the  effect  of  terms  that  happened 


to  be  present  in  some  query.  But  we  can  gain  access  to 
a  far  larger  set  of  terms  that  might  be  included  in  a  query 
by  examining  the  documents  rather  than  the  queries.  We 
therefore  chose  to  examine  terms  from  two  document  col¬ 
lections  in  order  to  explore  the  reasons  for  coverage  failures. 
We  chose  to  study  our  largest  non- Asian  bilingual  term  list 
(German-English  term  list  number  33  in  Appendix  A)  be¬ 
cause  comparable  collections  were  available  for  both  lan¬ 
guages.  For  the  English  analysis,  we  used  the  CLEF  2000 
English  collection  (described  above).  For  German,  we  used 
the  CLEF  2000  German  collection,  which  contains  153,499 
stories  (301  MB)  published  by  the  magazine  Der  Spiegel 
and  the  Frankfurter  Rundshau  newspaper  in  1994. 

The  English  collection  was  stemmed  using  Porter  stem- 
mer.  The  English  collection  contains  over  60  million  to¬ 
kens  after  stemming  and  stopword  removal,  approximately 
7%  of  which  do  not  appear  in  the  term  list.  German  com¬ 
pounds  were  split  using  the  greedy  left-to-right  longest  sub¬ 
string  match  matching  using  the  German  side  of  bilingual 
term  list.  The  German  collection  contains  approximately 
50  million  tokens,  approximately  17%  of  which  do  not  ap¬ 
pear  in  the  term  list.  In  each  case,  the  missing  terms  were 
collected  in  a  list  and  duplicates  were  removed.  The  first  au¬ 
thor  of  this  paper  then  examined  a  randomly  selected  sam¬ 
ple  of  1,000  words  from  each  list,  and  grouped  the  terms 
into  categories.  Figure  2  illustrates  the  observed  distribu¬ 
tion  of  terms  in  the  following  categories: 


Figure  2.  Distribution  of  the  out-of-vocabulary 
words  in  the  CLEF  2000  collection. 


Named  entities,  which  include  proper  nouns,  locations, 
brand  names,  etc.  For  example,  the  absence  of  Sydney 
from  some  terms  lists  is  what  resulted  in  the  extreme 
variability  in  average  precision  for  the  Bush  fire  near 


Sydney  topic.  Named  entities  comprise  almost  half  the 
missing  terms. 

General  vocabulary,  defined  as  words  that  would  be  ex¬ 
pected  to  be  found  in  a  comprehensive  printed  mono¬ 
lingual  dictionary  without  an  annotation  that  its  usage 
is  restricted  to  a  particular  domain.  For  example,  the 
general  vocabulary  term  firefighter  was  missing  from 
the  English  side  of  most  bilingual  term  lists.  Because 
missing  German  words  were  aggressively  split  to  find 
known  words  but  missing  English  words  were  not, 
general  vocabulary  is  more  often  found  to  be  missing 
in  English. 

Newly  formed  words,  which  are  built  from  other  words 
(typically  as  compound  terms).  For  example,  cyber¬ 
walk  was  found  in  the  English  collection,  but  that  term 
would  not  likely  appear  in  any  dictionary.  Some  Ger¬ 
man  compounds  could  not  be  split  because  one  con¬ 
stituent  was  missing  from  the  term  list  (e.g.  in  the  com¬ 
pound  German  term  gruselstory  (horror  story),  story  is 
a  loan  word  from  English  that  does  not  appear  in  the 
German  side  of  the  bilingual  term  list. 

Alternate  spellings,  some  of  which  are  typographical  er¬ 
rors,  and  others  of  which  result  from  regular  variations 
within  a  language.  For  example,  the  British  spelling 
privatisation  appears  in  one  query,  but  the  Los  Angeles 
Times  news  stories  contain  only  the  American  spelling 
privatization.  We  chose  to  group  these  two  cases  be¬ 
cause  we  expected  at  the  time  of  this  study  that  similar 
methods  (e.g.,  fuzzy  matching)  might  be  used  to  deal 
with  them. 

Domain-specific  terminology,  which  would  not  be  ex¬ 
pected  to  be  present  in  broad-coverage  lexical  re¬ 
sources.  For  example,  the  term  thoracoscope  would 
be  expected  to  appear  only  in  term  lists  specialized  to 
the  medical  domain. 

Abbreviations,  which  might  either  be  acronyms  or  short¬ 
ened  forms  of  a  term.  For  example,  the  abbreviation 
itrl  did  not  appear  on  either  side  of  the  bilingual  term 
list. 

Loan  words,  which  are  adopted  from  another  language 
with  no  change  in  meaning,  but  perhaps  with  minor 
variations  in  spelling.  For  example,  the  Russian  term 
glasnost  appeared  in  the  English  collection.  Loan 
words  were  considerably  more  common  in  German 
than  in  English,  perhaps  reflecting  the  pervasive  influ¬ 
ence  of  American  media  on  adoption  of  terms  in  other 
cultures. 

Transcribed  sounds,  which  are  used  to  convey  colloqui¬ 
alisms  or  to  imitate  sounds.  For  example  the  term 
aaaarrff  was  found  in  the  German  collection. 


Undecidable,  a  category  that  was  used  to  code  any  term 
that  could  not  be  reliably  placed  in  another  category. 
Terms  were  normalized  to  lower  case  and  removed 
from  their  context  before  being  examined,  and  some¬ 
times  this  precluded  accurate  categorization.  For  ex¬ 
ample  simeone  might  have  been  a  misspelling  of  some¬ 
one ,  or  it  might  have  been  the  name  of  a  person. 

As  figure  2  clearly  shows,  named  entities  are  by  far  the 
most  common  type  of  missing  term.  We  therefore  elected  to 
study  their  impact  on  retrieval  effectiveness  in  more  detail 
in  the  next  section. 

5  The  Impact  of  Named  Entities 

Early  work  on  dictionary-based  approaches  to  CLIR  in 
European  languages  generally  showed  relatively  little  ad¬ 
verse  effect  from  omission  of  named  entities.  When  per¬ 
forming  CLIR  among  European  languages,  the  usual  ap¬ 
proach  is  to  retain  untranslatable  terms  unchanged  (perhaps 
with  the  omission  of  accents  or  other  diacritics),  and  named 
entities  such  as  the  names  of  persons  were  often  written 
the  same  way  in  both  languages.  Once  experimentation  ex¬ 
tended  to  language  pairs  with  different  character  sets,  how¬ 
ever,  the  magnitude  of  the  problem  became  clear.  Since  our 
goal  is  to  characterize  coverage  effects,  we  have  chosen  to 
explore  three  conditions:  (1)  retain  all  named  entities,  (2) 
retain  only  those  named  entities  that  appear  in  the  English 
side  of  the  bilingual  term  list,  and  (3)  retain  no  named  enti¬ 
ties  (even  if  they  appear  in  the  term  list).  In  each  case,  we 
retain  all  terms  that  are  not  named  entities  if  and  only  if  they 
are  present  in  the  term  list.  The  first  condition  simulates  the 
case  in  which  named  entity  translation  is  perfect  (e.g.,  when 
string  matching  between  European  languages  works).  The 
second  simulates  the  cross-character  set  condition  (with  no 
augmentation  from  transliteration),  and  is  identical  to  the 
condition  reported  in  Section  3.  The  third  is  a  contrastive 
condition  that  is  designed  to  provide  a  reference  point  for 
the  other  two. 

We  hand  tagged  each  named  entity  in  the  33  ti- 
tle+description  queries  (22  queries  actually  contained 
named  entities)  and  then  repeated  the  experiments  described 
in  Section  3.  As  before,  we  performed  this  experiment  us¬ 
ing  the  English  side  of  all  of  the  term  lists  and  only  the 
CLEF  2000  English  test  collection.  We  obtained  similar  re¬ 
sults  with  and  without  simulating  backoff  translation,  so  we 
show  only  the  backoff  condition  in  Figure  3.  The  trend  in 
the  first  and  third  conditions  is  quite  clear,  with  term  lists 
that  contain  more  than  30,000  unique  English  terms  almost 
always  achieving  maximal  retrieval  effectiveness.  The  solid 
horizontal  line  at  0.4  shows  the  results  of  a  full  monolin¬ 
gual  query,  and  the  dashed  line  at  0.225  shows  the  results 
that  would  be  achieved  by  removing  all  (and  only)  named 


entities  from  the  queries.  Figure  3  reveals  a  sigmoidal 
shape,  with  little  benefit  from  term  lists  that  contain  fewer 
than  3,000  unique  English  terms,  nearly  linear  improvement 
with  increasing  coverage  between  3,000  and  30,000  unique 
terms,  and  little  benefit  from  further  increases  beyond  that 
size.  From  examining  the  middle  condition,  it  seems  clear 
that  much  of  the  observed  variation  reported  in  Section  3 
resulted  from  differences  in  the  coverage  of  named  entities. 


Figure  3.  Impact  of  named  entities  in  simu¬ 
lated  CLIR.  Squares:  always  used;  Triangles: 
never  used;  Dotted  line:  used  if  present  in 
term  list. 

We  also  ran  one  full  CLIR  experiment  to  see  the  effect 
of  named  entity  handling  in  actual  practice.  We  chose  the 
English-Chinese  language  pair  for  those  experiments  be¬ 
cause  that  allowed  us  to  use  our  largest  available  bilingual 
term  list  (Number  35  in  Appendix  A).  For  this  experiment, 
we  used  the  TREC  5-6  Mandarin  Corpus  of  approximately 
170  megabytes  of  articles  drawn  from  the  People’s  Daily 
newspaper  and  the  Xinhua  newswire.  That  test  collection 
contains  54  Chinese  topics,  for  which  both  English  trans¬ 
lations  and  relevance  judgments  are  available.  We  hand 
tagged  all  named  entities  in  the  English  translations,  and 
each  was  then  manually  translated  into  Chinese  by  a  native 
speaker  of  Chinese.  The  same  native  speaker  also  hand- 
segmented  the  Chinese  query  terms  for  use  in  a  contrastive 
monolingual  Chinese  run. 

We  used  Pirkola’s  method  [15])  to  structure  the  trans¬ 
lated  queries  and  used  the  InQuery  text  retrieval  system. 
This  has  the  effect  of  estimating  the  within-document  term 
frequency  for  each  query  term  in  each  document  as  the  sum 
of  the  frequencies  of  any  translation  of  the  query  term  in  that 


document,  and  the  collection-wide  “document  frequency” 
of  a  term  as  the  number  of  documents  in  which  any  trans¬ 
lation  of  that  term  occurs.  The  weight  for  each  query  term 
is  thus  computed  in  the  document  language — this  is  now 
widely  accepted  as  a  good  approach  when  translation  prob¬ 
ability  information  is  not  available.  The  rest  of  the  ex¬ 
periment  design  followed  the  monolingual  experiment  de¬ 
scribed  in  Section  3,  with  three  conditions:  (1)  all  named 
entities  manually  translated,  (2)  named  entities  translated 
only  if  present  in  the  term  list,  and  (3)  named  entities  never 
translated.  In  each  case,  terms  other  than  named  entities 
were  translated  in  whatever  way  the  bilingual  term  list  spec¬ 
ified.  For  contrast,  we  also  performed  an  ordinary  (all  terms 
retained)  monolingual  Chinese  run  and  a  run  with  no  bilin¬ 
gual  term  list  but  all  named  entities  manually  translated. 

Figure  4  shows  the  results.  In  this  case,  the  bilingual 
term  list  contained  many  named  entities,  and  suppressing 
their  translation  hurt  substantially  (over  60%  reduction  in 
mean  average  precision,  statistically  significant).  Further 
improvement  appeared  to  result  from  manually  translating 
all  named  entities,  but  the  advantage  over  using  the  term  list 
alone  was  not  found  to  be  statistically  significant.  Taken 
together,  these  results  seem  reasonable,  since  the  bilingual 
term  list  that  we  chose  for  this  experiment  is  relatively  rich 
in  named  entities.  Interestingly,  manually  translating  only 
the  named  entities  (with  no  bilingual  term  list)  did  nearly 
as  well  as  using  the  bilingual  term  list  alone  (with  no  man¬ 
ual  translation  of  named  entities).  From  this  we  conclude 
that  the  results  of  our  one  actual  cross-language  experiment 
tend  to  support  the  results  we  obtained  with  single-language 
coverage  measurements  using  a  broader  range  of  bilingual 
term  lists. 

6  Accommodating  Coverage  Deficiencies 

In  this  section,  we  briefly  review  techniques  that  have 
been  used  to  overcome  deficiencies  in  the  coverage  of  bilin¬ 
gual  term  lists  in  CLIR  systems.  With  that  as  background, 
we  suggest  further  work  on  one  additional  technique,  trans¬ 
lation  extraction  from  comparable  corpora,  that  has  been 
explored  as  an  abstract  problem  in  computational  linguis¬ 
tics  but  not  yet  applied  to  CLIR. 

Two  broad  classes  of  approaches  have  been  tried  to  han¬ 
dle  named  entities  that  do  not  appear  in  bilingual  term  lists. 
The  first,  widely  used  when  both  languages  are  expressed  in 
the  same  writing  system,  is  to  retain  the  untranslated  term 
(or  perhaps  to  strip  accents,  if  accents  are  commonly  used 
in  only  one  of  the  two  languages).  In  addition  to  matching 
names  that  are  written  identically,  this  also  generates  felic¬ 
itous  matches  on  loan  words,  which  can  be  an  important 
factor  if  cultural  factors  have  resulted  in  significant  sharing 
over  time  within  the  language  pair.  When  the  languages 
use  different  writing  systems,  the  second  approach,  pho- 


netic  transliteration,  provides  a  useful  way  to  achieve  simi¬ 
lar  effects  (c.f.,  [1]). 
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Figure  4.  The  effect  of  named  entities  in  actual 
CLIR. 


Our  results  suggest  that  three  other  relatively  straight¬ 
forward  techniques  should  receive  more  attention  than  they 
have  to  date.  The  first  is  decompounding,  which  is  clearly 
of  value  in  a  freely  compounding  language  such  as  Ger¬ 
man,  but  which  might  also  be  helpful  in  some  cases  in  lan¬ 
guages  such  as  English  where  the  technique  is  rarely  ap¬ 
plied.  Since  decompounding  is  designed  to  enhance  recall, 
possibly  at  the  expense  of  precision,  a  backoff  strategy  in 
which  decompounding  is  only  tried  when  the  original  term 
is  not  found  is  probably  a  good  idea.  The  second  technique 
is  normalization  of  alternative  spellings,  which  is  a  variant 
of  the  spelling  correction  problem  (c.f.,  [3]).  The  beneficial 
effect  of  normalizing  alternative  spellings  may  not  be  easy 
to  see  using  standard  test  collections  because  most  such  col¬ 
lections  are  built  using  carefully  edited  news  or  journal  ar¬ 
ticles  with  carefully  prepared  queries.  But  many  real  ap¬ 
plications  (e.g.,  Web  searching)  are  considerably  messier. 
In  the  one  case  we  observed  in  our  collection  ( privatisation 
vs.  privatization),  normalization  was  helpful,  at  least  with 
larger  term  lists  that  were  likely  to  contain  the  normalized 
form.  Third,  special  handling  for  abbreviations  and  tran¬ 
scribed  words,  also  seems  to  merit  consideration.  In  many 
cases,  we  expect  that  language-specific  heuristics  will  be 
needed,  since  the  process  of  abbreviation  exhibits  consid¬ 
erable  variation  across  languages  and  the  transcription  of 
sounds  naturally  depends  on  how  people  speak. 

Of  the  two  remaining  categories,  general  vocabulary  was 
more  commonly  missing  than  domain-specific  vocabulary. 
This  statistic  can  be  misleading,  however,  for  two  reasons. 
First,  it  is  an  artifact  of  the  genre  of  our  test  collection — 
news  stories.  In  a  collection  of  medical  journal  articles,  for 
example,  we  would  expect  domain-specific  terminology  to 
be  far  more  prevalent  than  it  is  in  news  stories.  The  sec¬ 
ond  factor  is  that  domain-specific  terminology  tends  to  be 
quite  specific,  and  thus  quite  highly  weighted  by  informa¬ 


tion  retrieval  systems.  Finding  larger  bilingual  term  lists 
(up  to  approximately  30,000  unique  terms)  seems  to  be  an 
effective  way  of  increasing  the  coverage  of  general  vocabu¬ 
lary,  but  unless  specialized  term  lists  are  used,  that  approach 
would  likely  not  do  much  to  identify  translations  for  miss¬ 
ing  domain-specific  vocabulary.  For  this  reason,  in  the  re¬ 
mainder  of  this  section,  we  explore  the  potential  for  learning 
translations  of  domain-specific  terminology  in  another  way. 

Techniques  for  learning  translations  from  parallel  text 
collections  (i.e.,  collections  of  translation-equivalent  doc¬ 
ument  pairs)  have  been  widely  studied  (c.f.,  [11]),  but 
domain-specific  parallel  text  collections  have  proven  to  be 
difficult  to  obtain  in  many  practical  applications.  For  this 
reason,  some  researchers  in  computational  linguistics  have 
explored  ways  in  which  partial  knowledge  of  possible  trans¬ 
lations  is  used  to  learn  translations  for  additional  terms  from 
topically-related  text  collections  (or  “comparable  corpora”) 
in  each  language.  Comparable  corpora  are  typically  eas¬ 
ier  to  obtain  than  parallel  corpora  because  different  sources 
could  provide  the  collections  for  each  language. 

The  basic  approach  to  learning  new  translations  in  this 
way  is  modeled  to  some  extent  on  the  way  in  which  hu¬ 
mans  acquire  vocabulary  by  reading — the  context  in  which 
a  term  is  used  gives  a  clue  about  its  meaning.  The  key  idea 
for  using  comparable  corpora  to  learn  new  translations  is  to 
start  with  an  incomplete  bilingual  term  list,  use  the  known 
translation  relationships  to  discover  regions  in  the  two  col¬ 
lections  that  have  a  similar  pattern  of  word  use  (and  hence  a 
similar  topical  focus),  and  then  hypothesize  translation  re¬ 
lationships  between  any  terms  that  lack  a  known  translation 
relationship  but  that  repeatedly  appear  together  in  regions 
that  are  paired  in  this  way.  The  process  can  then  be  iterated 
to  further  improve  coverage.  Three  variants  on  this  basic 
idea  have  been  tried  by  Rapp  [16],  Fung  [5],  and  Picchi  and 
Peters  [14].  Every  technique  that  has  been  tried  relies  on 
some  way  of  estimating  the  importance  of  individual  terms 
and  then  combining  those  estimates  to  evaluate  term  impor¬ 
tance.  Fung  adapted  measures  used  in  information  retrieval, 
while  both  Peters  and  Picchi  and  Rapp  tried  similar  ways  of 
comparing  observed  and  expected  frequencies  to  estimate 
the  information  content  of  an  observed  co-occurrence.  Slid¬ 
ing  windows  are  commonly  used  to  limit  the  extent  of  the 
regions  that  are  candidates  for  alignment. 

Although  Sheridan  and  Schauble  demonstrated  im¬ 
proved  performance  in  cross-language  retrieval  when  do¬ 
main  specific  similarity  thesauri  obtained  from  context 
alignment  on  document  level  were  used,  we  are  not  aware 
of  any  case  in  which  the  idea  of  sliding  windows  for  context 
alignment  have  been  applied  in  a  CLIR  application  [18]. 

We  have  started  to  explore  the  potential  value  of  transla¬ 
tions  learned  from  comparable  collections  to  CLIR.  Picchi 
and  Peters  observed  that  although  homonymy  (terms  with 
different  meanings  that  are  written  identically)  precluded 


effective  use  of  the  technique  with  news  collections,  it  could 
be  quite  useful  in  domain-specific  applications.  We  do  not 
yet  have  an  appropriate  evaluation  collection  that  is  domain- 
specific,  so  we  have  started  by  checking  to  see  whether  any 
benefit  might  be  found  in  the  collections  that  we  have  been 
working  with.  We  implemented  a  technique  that  aligns  un¬ 
known  terms  according  to  similarity  of  their  translated  con¬ 
texts  within  windows  of  3,4,  and  5  tokens  immediately  sur¬ 
rounding  an  unknown  term  for  which  we  desired  to  learn  a 
translation  and  measured  the  degree  of  association  between 
that  term  and  all  terms  in  its  context.  We  call  our  normal¬ 
ized  association  measure  [2]  ’affinity,’  and  calculate  it  as 
follows: 

,  .  count  (wi,W2) 

A(Wi,W2)  =  - 7 - 7 - 7 - 7 - 7 - r 

countywi)  +  count\W2)  —  count(w\,  W2) 

In  our  initial  experiments  with  news  stories,  we  have 
found  that  this  measure  tends  to  associate  domain-specific 
terms  with  synonyms  and  related  terms  in  the  other  lan¬ 
guage,  in  other  words,  it  appears  that  the  learned  “transla¬ 
tions”  might  be  useful  for  information  retrieval.  Moreover, 
we  observed  that  the  set  of  “translations”  associated  with 
relatively  specific  terms  tends  to  be  relatively  short.  For  ex¬ 
ample  the  set  of  English  terms  with  the  strongest  affinity  to 
the  German  word  for  tumor  are  cancer,  oncology,  diagnose, 
risk,  treatment,  cause,  cell,  surgery,  radiation,  chemother¬ 
apy,  and  leukemia.  The  term  tumor  does  appear  in  one 
query,  so  we  computed  the  average  precision  for  that  query 
without  that  term  and  with  that  term,  finding  some  benefit. 

With  general  vocabulary,  we  observed  that  terms  associ¬ 
ated  with  particular  events  in  our  collection  tended  to  have 
a  high  affinity  with  the  terms  that  describe  that  event.  For 
example  Estline,  Baltic,  and  Estonia  have  the  highest  affin¬ 
ity  with  the  w or d  ferryboat  (the  collections  contain  stories 
about  a  ferryboat  that  sank  in  the  Baltic  Sea).  At  this  point 
our  results  are  merely  suggestive,  of  course,  but  it  appears 
that  this  is  a  promising  direction  for  further  work. 

7  Conclusion 

We  have  shown  that  the  pattern  previously  observed  in 
ablation  studies  of  lexicon  size  on  retrieval  effectiveness  is 
discernible  in  large  number  of  bilingual  terms  lists  that  we 
obtained  from  the  Internet.  Bilingual  term  lists  containing  at 
least  30,000  unique  terms  in  the  query  language  were  found 
to  optimize  the  coverage  of  general  vocabulary,  although 
the  coverage  of  named  entities  was  found  to  be  highly  vari¬ 
able,  resulting  in  substantial  variations  in  retrieval  effective¬ 
ness  for  lexicons  of  similar  size.  We  found  that  named  enti¬ 
ties  make  important  contributions  to  retrieval  effectiveness 
when  searching  news,  but  we  noted  that  proper  handling 
of  domain-specific  terms  may  be  more  important  in  other 


applications.  We  therefore  also  began  to  explored  a  strat¬ 
egy  for  learning  translations  of  domain-specific  terminol¬ 
ogy  from  comparable  corpora. 

Our  work  raises  some  interesting  new  questions,  perhaps 
the  most  important  of  which  is  whether  comparable  corpora 
can  be  shown  to  be  useful  in  domain-specific  CLIR  applica¬ 
tions.  In  addition  to  questions  of  retrieval  effectiveness,  im¬ 
portant  questions  about  computational  complexity  and  data 
sparseness  remain  to  be  explored.  The  National  Institute 
of  Informatics  (Japan)  has  created  a  Japanese-English  test 
collection  of  scientific  paper  abstracts  that  might  prove  to 
be  a  useful  tool  for  exploring  this  question.  Another  ques¬ 
tion  raised  by  Mayfield  and  MacNamee’s  recent  work  is  the 
effect  of  using  pre-translation  query  expansion  in  conjunc¬ 
tion  with  the  techniques  that  we  are  exploring.  By  exploring 
questions  such  as  these,  we  hope  to  push  frontier  in  CLIR 
research,  expanding  beyond  our  roots  in  retrieval  from  col¬ 
lections  of  news  stories  to  a  broad  range  of  applications  that 
reflect  the  rich  potential  of  this  technology. 
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Appendix  A.  Bilingual  term  lists  used  in  the  experiment. 


language 

lexicon  size 

available  from 

1. 

Eskimo 

700 

http://www.pageweb.com/kleekai/ 

2. 

Swahili 

957 

http :  //www.  freel  ang .  com 

3. 

Old  English 

1.026 

http  ://w  ww.freelang .  com 

4. 

Indonesian 

1.070 

http  ://www.freelang .  com 

5. 

Welsh 

1096 

http :  //www.  freel  ang .  com 

6. 

Portuguese 

1,228 

http  ://www.j  une29 .  com/IDP 

7. 

Latin 

1,568 

http://www.freelang.com 

8. 

Finnish 

2,804 

http://www.freelang.com 

9. 

French 

2,898 

http  ://www.j  une29 .  com/IDP 

10. 

Icelandic 

3,148 

http  ://www.freelang .  com 

11. 

Danish 

3,703 

http  ://www.freelang .  com 

12. 

Afrikaans 

4.185 

http  ://www.freelang .  com 

13. 

Italian 

4,860 

http  ://www.j  une29 .  com/IDP 

14. 

Greek 

5,437 

http  ://www.freelang .  com 

15. 

Portuguese 

5,868 

http  ://www.freelang .  com 

16. 

Norwegian 

6,027 

http  ://www.freelang .  com 

17. 

German 

6,265 

http  ://www.j  une29 .  com/IDP 

18. 

Spanish 

6,545 

http  ://www.j  une29 .  com/IDP 

19. 

Dutch 

9,959 

http  ://www.freelang .  com 

20. 

Swedish 

10,052 

http  ://www.freelang .  com 

21. 

Italian 

13,475 

http://www.wordgumbo.com 

22. 

Esperanto 

16.710 

http  ://www.freelang .  com 

23. 

French 

17,466 

http  ://w  ww.freelang .  com 

24. 

French 

20,078 

http://www.wordgumbo.com 

25. 

Spanish 

20,761 

http://www.wordgumbo.com 

26. 

Italian 

28,087 

http  ://www.freelang .  com 

27. 

Russian 

31,725 

http://www.freelang.com 

28. 

Spanish 

35,752 

http  ://w  ww.freelang .  com 

29. 

Japanese 

54,112 

http  ://w  ww.freelang .  com 

30. 

Hungarian 

63,164 

http://www.freelang.com 

31. 

German 

89,046 

http  ://www.freelang .  com 

32. 

German 

97,038 

http  ://www.quickdic .  de/ 

33. 

German 

103,166 

http://www.tu-chemnitz.de/dict 

34. 

Chinese  (LDC  Version  2) 

110,831 

http://morph.ldc.upenn.edu 

35. 

Chinese  (CETA) 

193,297 

MRM  corporation,  Kensington,  MD 

